A Comparison of String Distance Metrics for Name-Matching Tasks

نویسندگان

  • William W. Cohen
  • Pradeep Ravikumar
  • Stephen E. Fienberg
چکیده

Using an open-source, Java toolkit of name-matching methods, we experimentally compare string distance metrics on the task of matching entity names. We investigate a number of different metrics proposed by different communities, including edit-distance metrics, fast heuristic string comparators , token-based distance metrics, and hybrid methods. Overall, the best-performing method is a hybrid scheme combining a TFIDF weighting scheme, which is widely used in information retrieval, with the Jaro-Winkler string-distance scheme, which was developed in the probabilistic record linkage community.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Usability of String Distance Metrics for Name Matching Tasks in Polish

This paper presents results of the numerous experiments on usability of well-established string distance metrics and some new variants thereof for various name matching tasks in Polish.

متن کامل

A Comparison of String Metrics for Matching Names and Records

We describe an open-source Java toolkit of methods for matching names and records. We summarize results obtained from using various string distance metrics on the task of matching entity names. These metrics include distance functions proposed by several different communities, such as edit-distance metrics, fast heuristic string comparators, token-based distance metrics, and hybrid methods. We ...

متن کامل

Chronic Treatment by L-NAME differently Affects Morris Water Maze Tasks in Ovariectomized and Naïve Female Rats

Introduction: The role of ovarian hormones and nitric oxide (NO) in learning and memory and their interaction has been widely investigated. The present study carried out to evaluate different effect of L-NAME on spatial learning and memory of ovariectomized (OVX) and sham operated rats. Methods: 32 rats were divided into 4 groups: 1) Sham 2) OVX 3) Sham-LN and 4) OVX-LN. The animals of groups 3...

متن کامل

Real World Performance of Approximate String Comparators for use in Patient Matching

Medical record linkage is becoming increasingly important as clinical data is distributed across independent sources. To improve linkage accuracy we studied different name comparison methods that establish agreement or disagreement between corresponding names. In addition to exact raw name matching and exact phonetic name matching, we tested three approximate string comparators. The approximate...

متن کامل

A Comparison of String Distance Metrics on Usernames for Cross-Platform Identification

People often use similar usernames across different social media sites. This fact can be used to correlate accounts between different platforms. Since the first mention of this fact in 2009 no research has been done on how to exploit it most efficiently. We showed that ignoring the casing will most definitely improve the matching and we found that Smith-Waterman provides the best metric to matc...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003